Search CORE

355 research outputs found

OpenTED Browser: Insights into European Public Spendings

Author: Bontempi Gianluca
Borgne Yann-Aël Le
Homolova Adriana
Publication venue
Publication date: 01/01/2016
Field of study

We present the OpenTED browser, a Web application allowing to interactively browse public spending data related to public procurements in the European Union. The application relies on Open Data recently published by the European Commission and the Publications Office of the European Union, from which we imported a curated dataset of 4.2 million contract award notices spanning the period 2006-2015. The application is designed to easily filter notices and visualise relationships between public contracting authorities and private contractors. The simple design allows for example to quickly find information about who the biggest suppliers of local governments are, and the nature of the contracted goods and services. We believe the tool, which we make Open Source, is a valuable source of information for journalists, NGOs, analysts and citizens for getting information on public procurement data, from large scale trends to local municipal developments.Comment: ECML, PKDD, SoGood workshop 201

arXiv.org e-Print Archive

DI-fusion

Feature selection in high-dimensional dataset using MapReduce

Author: Bontempi Gianluca
Borgne Yann-Aël Le
Reggiani Claudio
Publication venue
Publication date: 07/09/2017
Field of study

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

arXiv.org e-Print Archive

DI-fusion

From dependency to causality: a machine learning approach

Author: Bontempi Gianluca
Flauder Maxime
Publication venue
Publication date: 19/12/2014
Field of study

The relationship between statistical dependency and causality lies at the heart of all statistical approaches to causal inference. Recent results in the ChaLearn cause-effect pair challenge have shown that causal directionality can be inferred with good accuracy also in Markov indistinguishable configurations thanks to data driven approaches. This paper proposes a supervised machine learning approach to infer the existence of a directed causal link between two variables in multivariate settings with

n>2

variables. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected. Our results show that supervised learning methods may be successfully used to extract causal information on the basis of asymmetric statistical descriptors also for

n>2

variate distributions.Comment: submitted to JML

arXiv.org e-Print Archive

DI-fusion

minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information

Author: Bontempi Gianluca
Lafitte Frédéric
Meyer Patrick E
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

SCOPUS: ar.jinfo:eu-repo/semantics/publishe

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DI-fusion

Open Repository and Bibliography - Liège

Study of meta-analysis strategies for network inference using information-theoretic approaches

Author: Bellot Pujalte Pau
Bontempi Gianluca
Haibe-Kains Benjamin
Meyer Patrick E.
Pham Ngoc C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches focused on individual datasets, which typically suffer from some experimental bias and a small number of samples. To date, there are mainly two strategies for the problem of interest: the first one (”data merging”) merges all datasets together and then infers a GRN whereas the other (”networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking. In this paper, we evaluate the performances of various metaanalysis approaches mentioned above with a systematic set of experiments based on in silico benchmarks. Furthermore, we present a new meta-analysis approach for inferring GRNs from multiple studies. Our proposed approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix.Peer ReviewedPostprint (author's final draft

University of Toronto Research Repository

Crossref

UPCommons. Portal del coneixement obert de la UPC

Directory of Open Access Journals

DI-fusion

On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information

Author: Bontempi Gianluca
Meyer Patrick E
Olsen Catharina
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

SCOPUS: ar.jinfo:eu-repo/semantics/publishe

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DI-fusion

Open Repository and Bibliography - Liège

Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives

Author: Bontempi Gianluca
Caelen Olivier
Lunghi Danele
Simitsis Alkis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/07/2023
Field of study

Data economy relies on data-driven systems and complex machine learning applications are fueled by them. Unfortunately, however, machine learning models are exposed to fraudulent activities and adversarial attacks, which threaten their security and trustworthiness. In the last decade or so, the research interest on adversarial machine learning has grown significantly, revealing how learning applications could be severely impacted by effective attacks. Although early results of adversarial machine learning indicate the huge potential of the approach to specific domains such as image processing, still there is a gap in both the research literature and practice regarding how to generalize adversarial techniques in other domains and applications. Fraud detection is a critical defense mechanism for data economy, as it is for other applications as well, which poses several challenges for machine learning. In this work, we describe how attacks against fraud detection systems differ from other applications of adversarial machine learning, and propose a number of interesting directions to bridge this gap

arXiv.org e-Print Archive

Information-Theoretic Inference of Large Transcriptional Regulatory Networks

Author: Bontempi Gianluca
Kontos Kevin
Lafitte Frederic
Meyer Patrick E
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on maximum relevance/minimum redundancy (MRMR), an effective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.SCOPUS: ar.jinfo:eu-repo/semantics/publishe

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DI-fusion

Open Repository and Bibliography - Liège

Impact of filter feature selection on classification: an empirical study

Author: Abelló Gamazo Alberto
Bilalli Besim
Bontempi Gianluca
Njoku Uchechukwu Fortune
Publication venue: CEUR-WS.org
Publication date: 01/01/2022
Field of study

The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more effective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC